GH-48202: [C++][Parquet] Fix encoder & decoder logic to enable Parque… #48203

Vishwanatha-HD · 2025-11-21T13:54:05Z

…t DB support on s390x

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the encoder & decoder logic. Encoders and Decoders are the main part of most of the parquet & arrow-parquet testcases and needs fix for variaous encoding & decoding types.

What changes are included in this PR?

The fix includes changes to following files:
cpp/src/parquet/decoder.cc
cpp/src/parquet/encoder.cc

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No

GitHub main Issue link: #48151

GitHub Issue: [C++][Parquet] Fix encoder and decoder logic to enable Parquet DB support on Big-Endian (s390x) systems #48202

github-actions · 2025-11-21T13:54:28Z

⚠️ GitHub issue #48202 has been automatically assigned in GitHub to PR creator.

k8ika0s · 2025-11-23T21:50:51Z

@Vishwanatha-HD Hey! Really appreciate you taking on the encoder/decoder paths for s390x — these two files are where a lot of the subtle BE issues first show up.

One thing I ran into on real s390x hardware is that Arrow’s array buffers already store their scalars in canonical little-endian format. Because of that, per-value swapping inside the Plain/Arrow fast paths can sometimes lead to an unintended double-swap, especially when mixing Arrow-originated inputs with non-Arrow callers (e.g., DeltaBitPack or ByteStreamSplit feeding into the same decode path).

A couple of spots I’m curious about here:

• PlainDecoder → Arrow fast path
The #if ARROW_LITTLE_ENDIAN branches check out for host-native buffers, but Arrow itself always hands you LE data. Does this path avoid re-swapping values that are already canonical LE from Arrow? I’ve seen that cause subtle mismatches on BE when DeltaBitPack or BSS push Arrow-arrays directly into the decoder.

• PlainEncoder (primitive path)
On BE, the per-value ToLittleEndian write works for correctness, though I’ve found cases where staging to a single LE scratch buffer helps avoid partial/mixed-endian outputs when builders and sinks run back-to-back.

• ByteStreamSplit
Here the code assumes DoSplitStreams handles endianness, but BSS usually expects inputs to already be in canonical LE order before the streams are interleaved. With native-order buffers on BE, the shuffle sometimes produces different stats/dictionary bytes across architectures. Curious if you’ve tested mixed Arrow/non-Arrow inputs through this path?

None of this blocks the PR — just sharing things I hit in BE testing across the encode/decode → stats → page-index chain.

Vishwanatha-HD · 2025-11-24T13:30:32Z

@Vishwanatha-HD Hey! Really appreciate you taking on the encoder/decoder paths for s390x — these two files are where a lot of the subtle BE issues first show up.

One thing I ran into on real s390x hardware is that Arrow’s array buffers already store their scalars in canonical little-endian format. Because of that, per-value swapping inside the Plain/Arrow fast paths can sometimes lead to an unintended double-swap, especially when mixing Arrow-originated inputs with non-Arrow callers (e.g., DeltaBitPack or ByteStreamSplit feeding into the same decode path).

A couple of spots I’m curious about here:

• PlainDecoder → Arrow fast path The #if ARROW_LITTLE_ENDIAN branches check out for host-native buffers, but Arrow itself always hands you LE data. Does this path avoid re-swapping values that are already canonical LE from Arrow? I’ve seen that cause subtle mismatches on BE when DeltaBitPack or BSS push Arrow-arrays directly into the decoder.

• PlainEncoder (primitive path) On BE, the per-value ToLittleEndian write works for correctness, though I’ve found cases where staging to a single LE scratch buffer helps avoid partial/mixed-endian outputs when builders and sinks run back-to-back.

• ByteStreamSplit Here the code assumes DoSplitStreams handles endianness, but BSS usually expects inputs to already be in canonical LE order before the streams are interleaved. With native-order buffers on BE, the shuffle sometimes produces different stats/dictionary bytes across architectures. Curious if you’ve tested mixed Arrow/non-Arrow inputs through this path?

None of this blocks the PR — just sharing things I hit in BE testing across the encode/decode → stats → page-index chain.

@k8ika0s.. Thanks for your review comments.. While agree with you that from the structural point of view, things could be log better.. But please note that, I have tested with my changes on the s390x systems and also on Openshift AI workloads.. It works properly.. Hence there is no concern with these changes..

cpp/src/parquet/decoder.cc

cpp/src/parquet/encoder.cc

…Parquet DB support on s390x

Vishwanatha-HD

I have addressed all the review comments. Thanks

cpp/src/parquet/decoder.cc

cpp/src/parquet/encoder.cc

cpp/src/parquet/decoder.cc

cpp/src/parquet/encoder.cc

cpp/src/parquet/decoder.cc

cpp/src/parquet/encoder.cc

Vishwanatha-HD requested a review from wgtmac as a code owner November 21, 2025 13:54

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Nov 21, 2025

Vishwanatha-HD mentioned this pull request Nov 21, 2025

[C++][Parquet] Fix encoder and decoder logic to enable Parquet DB support on Big-Endian (s390x) systems #48202

Open

k8ika0s mentioned this pull request Nov 21, 2025

GH-48213: [C++][Parquet] Fix endianness and test failures on s390x (big-endian) (supersedes partial fixes) #48212

Closed

Vishwanatha-HD mentioned this pull request Nov 21, 2025

[C++][Parquet] Enable Parquet DB support on Big Endian (IBM Z) systems #48151

Open

Vishwanatha-HD force-pushed the fixEncoderDecoder branch from 4efffd9 to 8c99313 Compare November 22, 2025 04:59